{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "547c3bf9-1f60-4db8-a290-bca9e612c7b4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/javascript": "\ntry {\nrequire(['notebook/js/codecell'], function(codecell) {\n  codecell.CodeCell.options_default.highlight_modes[\n      'magic_text/x-csrc'] = {'reg':[/^%%microblaze/]};\n  Jupyter.notebook.events.one('kernel_ready.Kernel', function(){\n      Jupyter.notebook.get_cells().map(function(cell){\n          if (cell.cell_type == 'code'){ cell.auto_highlight(); } }) ;\n  });\n});\n} catch (e) {};\n"
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/javascript": "\ntry {\nrequire(['notebook/js/codecell'], function(codecell) {\n  codecell.CodeCell.options_default.highlight_modes[\n      'magic_text/x-csrc'] = {'reg':[/^%%pybind11/]};\n  Jupyter.notebook.events.one('kernel_ready.Kernel', function(){\n      Jupyter.notebook.get_cells().map(function(cell){\n          if (cell.cell_type == 'code'){ cell.auto_highlight(); } }) ;\n  });\n});\n} catch (e) {};\n"
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from pynq import Overlay\n",
    "overlay = Overlay(\"/home/ubuntu/workspace/pynq_bitfiles/2-28/MatMul_SA10.bit\")\n",
    "accel_ip = overlay.mmult_accel_0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebb7bda2-19a1-41d7-8d28-0033e63ddadd",
   "metadata": {},
   "source": [
    "# Core Function\n",
    "- `call_fpga()`: Handles memory management and parameter configuration for the hardware accelerator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cdcef6d1-6327-4acd-9c82-5b1d7a6c7a6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def call_fpga(A_buf, B_buf, C_buf, accel_ip, N, K, M, update_A):\n",
    "    \"\"\"\n",
    "    Runs a 2D matrix multiplication on the FPGA accelerator:\n",
    "      (N, K) x (K, M) => (N, M)\n",
    "\n",
    "    A_buf, B_buf, C_buf are PYNQ buffers allocated with shape=(N,K), (K,M), (N,M).\n",
    "    update_A: 1 to load A into BRAM (new input), 0 to reuse persistent A.\n",
    "    \"\"\"\n",
    "    # print(\"calling fpga, update_A =\", update_A)\n",
    "    \n",
    "    # Flush input buffers to ensure data consistency.\n",
    "    # Only flush A_buf if we intend to update A (update_A==1).\n",
    "    if update_A:\n",
    "        A_buf.flush()\n",
    "    B_buf.flush()\n",
    "\n",
    "    # Configure the accelerator registers\n",
    "    accel_ip.register_map.A_1 = A_buf.physical_address & 0xFFFFFFFF\n",
    "    accel_ip.register_map.A_2 = (A_buf.physical_address >> 32) & 0xFFFFFFFF\n",
    "    accel_ip.register_map.B_1 = B_buf.physical_address & 0xFFFFFFFF\n",
    "    accel_ip.register_map.B_2 = (B_buf.physical_address >> 32) & 0xFFFFFFFF\n",
    "    accel_ip.register_map.C_1 = C_buf.physical_address & 0xFFFFFFFF\n",
    "    accel_ip.register_map.C_2 = (C_buf.physical_address >> 32) & 0xFFFFFFFF\n",
    "    accel_ip.register_map.N = N\n",
    "    accel_ip.register_map.K = K\n",
    "    accel_ip.register_map.M = M\n",
    "    # Pass the update_A flag to the accelerator\n",
    "    accel_ip.register_map.update_A = update_A\n",
    "\n",
    "    # Start the accelerator\n",
    "    accel_ip.register_map.CTRL.AP_START = 1\n",
    "\n",
    "    # Wait for finish\n",
    "    while accel_ip.register_map.CTRL.AP_DONE == 0:\n",
    "        pass\n",
    "\n",
    "    # Invalidate output buffer so the CPU sees the updated data from DDR\n",
    "    C_buf.invalidate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31ce9c9a-d51f-45a2-9574-827ca1d3e0b5",
   "metadata": {},
   "source": [
    "# Helper Functions\n",
    "This block contains utility functions for FPGA-based acceleration.  \n",
    "- **call_fpga()**: Sends matrix multiplication tasks to the FPGA and retrieves results.\n",
    "- **pynq_buffer_from_numpy()**: Converts NumPy arrays to PYNQ-compatible buffers.\n",
    "- **requantize()**: Converts int32 arrays to int8 using a scaling factor and zero point.\n",
    "- **display_model_confidence()**: Converts logits to human-readable class confidence.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "25665a2f-2e09-45fc-8d92-ad87ceea071b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from pynq import allocate\n",
    "\n",
    "def pynq_buffer_from_numpy(np_array):\n",
    "    \"\"\"\n",
    "    Allocates a PYNQ buffer with the same shape and dtype as np_array,\n",
    "    then copies the data into the buffer.\n",
    "    \"\"\"\n",
    "    buf = allocate(np_array.shape, dtype=np_array.dtype)\n",
    "    np.copyto(buf, np_array)\n",
    "    return buf\n",
    "\n",
    "def requantize(int32_array, scale, zero_point=0):\n",
    "    \"\"\"\n",
    "    Requantizes an int32 numpy array to int8 using the provided scale and zero_point.\n",
    "    Operation: int8_val = clip(round(int32_val * scale + zero_point), -128, 127)\n",
    "    \"\"\"\n",
    "    scaled = np.round(int32_array * scale + zero_point)\n",
    "    int8_array = np.clip(scaled, -128, 127).astype(np.int8)\n",
    "    return int8_array\n",
    "\n",
    "def display_model_confidence(logits, device_name=\"Model\"):\n",
    "    \"\"\"\n",
    "    Converts logits to probabilities and prints a user-friendly confidence message.\n",
    "\n",
    "    Parameters:\n",
    "    logits (torch.Tensor): The raw model output (logits).\n",
    "    device_name (str): Name of the device (e.g., \"CPU\", \"FPGA\") for comparison.\n",
    "    \"\"\"\n",
    "    # Convert logits to probabilities\n",
    "    probs = torch.softmax(logits, dim=1)\n",
    "\n",
    "    # Get predicted class and confidence\n",
    "    predicted_class = torch.argmax(probs, dim=1).item()\n",
    "    confidence = probs[0, predicted_class].item() * 100\n",
    "\n",
    "    # Print result\n",
    "    print(f\"{device_name}: The model is {confidence:.2f}% confident in predicting class {predicted_class}.\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99268799-d711-4d89-a410-b7ebecd98f43",
   "metadata": {},
   "source": [
    "# Custom Module for FPGA Offload\n",
    "#### FPGA-Optimized Linear Layer for Q, K, V Projections\n",
    "This block defines **FPGAQuantizedLinear**, a custom PyTorch module that replaces  \n",
    "the standard linear layers with FPGA-accelerated equivalents. It:\n",
    "- **Quantizes activations** before computation.\n",
    "- **Uses PYNQ buffers** to store inputs and weights.\n",
    "- **Invokes the FPGA accelerator** for matrix multiplications.\n",
    "- **Dequantizes the result** back to floating point.\n",
    "This module is later integrated into DistilBERT for hardware acceleration.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a8ab79b8-7967-45bf-be6f-8d05b2115576",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import numpy as np\n",
    "\n",
    "class FPGAQuantizedLinear(torch.nn.Module):\n",
    "    def __init__(self, quantized_linear, act_scale, accel_ip, hidden_size=768, update_A=True):\n",
    "        \"\"\"\n",
    "        Parameters:\n",
    "          quantized_linear : an instance of DynamicQuantizedLinear from the quantized model.\n",
    "          act_scale        : scaling factor for quantizing input activations.\n",
    "          accel_ip         : the FPGA accelerator IP handle.\n",
    "          hidden_size      : hidden dimension size (typically 768).\n",
    "          update_A         : flag indicating whether to update A in persistent BRAM (True for Q, False for K/V).\n",
    "        \"\"\"\n",
    "        total_fpga_compute_time = 0.0\n",
    "        call_count = 0\n",
    "        \n",
    "        super(FPGAQuantizedLinear, self).__init__()\n",
    "        self.accel_ip = accel_ip\n",
    "        self.hidden_size = hidden_size\n",
    "        self.act_scale = act_scale\n",
    "        self.update_A = update_A  # Store the update flag\n",
    "        \n",
    "        # Extract quantized weight and its parameters.\n",
    "        self.weight_int8_tensor = quantized_linear.weight().int_repr()\n",
    "        self.weight_scale = quantized_linear.weight().q_scale()\n",
    "        self.weight_zero_point = quantized_linear.weight().q_zero_point()\n",
    "        # Transpose so that the weight shape becomes (in_features, out_features)\n",
    "        self.weight_int8 = self.weight_int8_tensor.cpu().numpy().T  # shape: (hidden_size, hidden_size)\n",
    "        \n",
    "        # Effective scale: multiplication of activation scale and weight scale.\n",
    "        self.effective_scale = self.act_scale * self.weight_scale\n",
    "        \n",
    "        # Check for bias. Note that in DynamicQuantizedLinear, bias remains in FP32.\n",
    "        bias_val = quantized_linear.bias()  # This calls the bound method.\n",
    "        if bias_val is not None:\n",
    "            # Save bias as a NumPy array (shape: (hidden_size,))\n",
    "            self.bias = bias_val.detach().cpu().numpy().astype(np.float32)\n",
    "        else:\n",
    "            self.bias = None\n",
    "\n",
    "    def forward(self, x):\n",
    "        \"\"\"\n",
    "        Forward pass for FPGA offload.\n",
    "        Accepts input x which may be 2D (N, D) or 3D (B, S, D). In case of 3D input,\n",
    "        the tensor is reshaped to 2D for matrix multiplication and then reshaped back.\n",
    "        The input is quantized to int8 using self.act_scale. After the FPGA multiplication,\n",
    "        the int32 result is dequantized to FP32 and the bias is added (if available).\n",
    "        \"\"\"\n",
    "        # Save the original shape.\n",
    "        orig_shape = x.shape\n",
    "        if x.dim() == 3:\n",
    "            B, S, D = x.shape\n",
    "            x_flat = x.reshape(B * S, D)\n",
    "        else:\n",
    "            x_flat = x\n",
    "\n",
    "        # Determine the number of rows for the FPGA call.\n",
    "        N = x_flat.shape[0]\n",
    "\n",
    "        # Quantize the input if it is in float32.\n",
    "        if x_flat.dtype == torch.float32:\n",
    "            x_int8 = torch.clamp(torch.round(x_flat / self.act_scale), -128, 127).to(torch.int8)\n",
    "        else:\n",
    "            x_int8 = x_flat\n",
    "\n",
    "        # Convert to a NumPy int8 array.\n",
    "        x_np = x_int8.cpu().numpy().astype(np.int8)\n",
    "        \n",
    "        # Convert input activation and weight to PYNQ buffers.\n",
    "        A_buf = pynq_buffer_from_numpy(x_np)\n",
    "        W_buf = pynq_buffer_from_numpy(self.weight_int8)\n",
    "        # Allocate an output buffer for the int32 result (shape: (N, hidden_size))\n",
    "        C_buf = allocate((N, self.hidden_size), dtype=np.int32)\n",
    "        \n",
    "        # Call the FPGA accelerator:\n",
    "        # Instead of hardcoding update_A=1, we now use self.update_A:\n",
    "        # Time just the FPGA computation\n",
    "        start_fpga = time.time()\n",
    "        call_fpga(A_buf, W_buf, C_buf, self.accel_ip, N, self.hidden_size, self.hidden_size, update_A=int(self.update_A))\n",
    "        fpga_duration = time.time() - start_fpga\n",
    "        \n",
    "        FPGAQuantizedLinear.total_fpga_compute_time += fpga_duration\n",
    "        FPGAQuantizedLinear.call_count += 1\n",
    "        \n",
    "        # Retrieve the int32 result.\n",
    "        C_int32 = np.array(C_buf)\n",
    "        # Dequantize: convert int32 accumulator to FP32 using the effective scale.\n",
    "        out_fp32 = C_int32.astype(np.float32) * self.effective_scale\n",
    "        \n",
    "        # If a bias is present, add it (broadcast along axis 0).\n",
    "        if self.bias is not None:\n",
    "            # Ensure bias is added to each row.\n",
    "            out_fp32 = out_fp32 + self.bias\n",
    "        \n",
    "        # Convert back to a torch tensor.\n",
    "        out_tensor = torch.tensor(out_fp32, dtype=torch.float32)\n",
    "        \n",
    "        # If the original input was 3D, reshape back to (B, S, hidden_size).\n",
    "        if x.dim() == 3:\n",
    "            out_tensor = out_tensor.reshape(B, S, self.hidden_size)\n",
    "        return out_tensor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cbb0937-70a5-48f9-b85f-05abe5f02d19",
   "metadata": {},
   "source": [
    "# Replacing Q, K, V Layers with FPGA Versions\n",
    "This function walks through all transformer layers in the quantized DistilBERT model  \n",
    "and replaces the **Q, K, and V projection layers** with the custom **FPGAQuantizedLinear** module.\n",
    "- Ensures **Q projection updates A in BRAM** (update_A=True).\n",
    "- **K and V projections reuse A** for efficiency.\n",
    "- Enables model acceleration while preserving transformer layer structure.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "eece722d-3257-4872-9f48-600b16cc2433",
   "metadata": {},
   "outputs": [],
   "source": [
    "def integrate_fpga_offload(model_quant, act_scale, accel_ip, hidden_size=768):\n",
    "    \"\"\"\n",
    "    Replaces the Q, K, V projection layers in each transformer layer with the FPGA-accelerated custom module.\n",
    "    \n",
    "    Parameters:\n",
    "      model_quant  : Quantized DistilBertForSequenceClassification model.\n",
    "      act_scale    : Scaling factor for quantizing activations (assumed same for demo).\n",
    "      accel_ip     : Configured FPGA accelerator IP handle.\n",
    "      hidden_size  : Hidden dimension (typically 768).\n",
    "    \"\"\"\n",
    "    for layer in model_quant.distilbert.transformer.layer:\n",
    "        # For the Q projection, set update_A to True so that the persistent A is updated.\n",
    "        layer.attention.q_lin = FPGAQuantizedLinear(layer.attention.q_lin, act_scale, accel_ip, hidden_size, update_A=True)\n",
    "        # For K and V projections, set update_A to False to reuse A from BRAM.\n",
    "        layer.attention.k_lin = FPGAQuantizedLinear(layer.attention.k_lin, act_scale, accel_ip, hidden_size, update_A=False)\n",
    "        layer.attention.v_lin = FPGAQuantizedLinear(layer.attention.v_lin, act_scale, accel_ip, hidden_size, update_A=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1bc067c9-f8eb-4b46-8aec-fbc5df161aea",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def compute_activation_scale(activation_list, percentile=99.9, use_demo=0):\n",
    "    \"\"\"\n",
    "    Computes a global activation scale from a calibration set of activations.\n",
    "    \n",
    "    Parameters:\n",
    "      activation_list: List of NumPy arrays representing activations \n",
    "                       (for example, from the embedding layer).\n",
    "      percentile:      The percentile to use for robust scale computation (if use_demo=0).\n",
    "      use_demo:        If set to 1, uses the demo method (scale = max_abs_value/127.0);\n",
    "                       otherwise, uses the robust method (scale = percentile_value/127.0).\n",
    "    \n",
    "    Returns:\n",
    "      A scaling factor computed as:\n",
    "         - Demo method: scale = (max(|activations|)) / 127.0\n",
    "         - Robust method: scale = (percentile(|activations|)) / 127.0\n",
    "    \"\"\"\n",
    "    # Concatenate all activations from the calibration samples into one array.\n",
    "    all_activations = np.concatenate([act.flatten() for act in activation_list])\n",
    "    \n",
    "    if use_demo:\n",
    "        # Demo method: use the maximum absolute value.\n",
    "        act_abs_max = np.max(np.abs(all_activations))\n",
    "        scale = act_abs_max / 127.0 if act_abs_max != 0 else 1.0\n",
    "    else:\n",
    "        # Robust method: use the specified percentile.\n",
    "        act_abs_percentile = np.percentile(np.abs(all_activations), percentile)\n",
    "        scale = act_abs_percentile / 127.0 if act_abs_percentile != 0 else 1.0\n",
    "    \n",
    "    return scale\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "850e9e58-20e7-4c5f-ba65-63b500df50f1",
   "metadata": {},
   "source": [
    "# Example Usage – Custom Forward Pass Integration\n",
    "This block demonstrates how to:\n",
    "1. **Load and quantize a DistilBERT model**.\n",
    "2. **Extract activations** from the embedding layer.\n",
    "3. **Integrate FPGA acceleration** into transformer layers.\n",
    "4. **Run a forward pass** through the modified model.\n",
    "Only the **Q, K, and V projections** are offloaded to FPGA; the remaining layers run on CPU/GPU.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "515ef581-29af-467a-9375-83eaf32e3fbb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Global Activation Scale (Demo): 0.06596306177574819\n",
      "input = 'The moonlight shimmered over the ocean as waves gently kissed the sandy shore, while distant lanterns flickered in the cool evening breeze. A lone traveler wandered along the coastline, footsteps sinking softly into the damp sand, lost in thought. The rhythmic sound of the water mixed with the rustling palms, creating a nice'\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "from transformers import DistilBertTokenizer, DistilBertForSequenceClassification\n",
    "\n",
    "# Assume call_fpga() is already defined and accel_ip is configured on your KV260.\n",
    "# For example:\n",
    "# accel_ip = get_accel_ip_handle()   # <-- user-specific setup\n",
    "\n",
    "# 1. Load and Quantize the Model\n",
    "model_name = \"distilbert-base-uncased-finetuned-sst-2-english\"\n",
    "tokenizer = DistilBertTokenizer.from_pretrained(model_name)\n",
    "model = DistilBertForSequenceClassification.from_pretrained(model_name)\n",
    "model.eval()\n",
    "\n",
    "# Apply dynamic quantization to convert Linear layers to int8.\n",
    "model_int8 = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)\n",
    "model_int8.eval()\n",
    "\n",
    "# 2. Gather a Calibration Set of Activations to Compute a Global Activation Scale\n",
    "calib_sentences = [\n",
    "    \"The moonlight shimmered over the ocean as waves gently kissed the sandy shore, while distant lanterns flickered in the cool evening breeze. A lone traveler wandered along the coastline, footsteps sinking softly into the damp sand, lost in thought. The rhythmic sound of the water mixed with the rustling palms, creating a nice\"\n",
    "]\n",
    "calib_activations = []\n",
    "for sentence in calib_sentences:\n",
    "    inputs = tokenizer(sentence, return_tensors=\"pt\")\n",
    "    with torch.no_grad():\n",
    "        # Get the embedding output; shape: (B, L, 768). Here B=1.\n",
    "        emb = model.distilbert.embeddings(inputs.input_ids)  # shape: (1, L, 768)\n",
    "        # Remove the batch dimension.\n",
    "        emb = emb.squeeze(0)  # shape: (L, 768)\n",
    "        calib_activations.append(emb.cpu().numpy())\n",
    "\n",
    "# Compute the activation scale using the robust method (percentile-based):\n",
    "# global_act_scale_robust = compute_activation_scale(calib_activations, percentile=99.9, use_demo=0)\n",
    "# print(\"Global Activation Scale (Robust):\", global_act_scale_robust)\n",
    "\n",
    "# # Compute the activation scale using the demo method (max-based):\n",
    "global_act_scale_demo = compute_activation_scale(calib_activations, use_demo=1)\n",
    "print(\"Global Activation Scale (Demo):\", global_act_scale_demo)\n",
    "\n",
    "test_sentence = calib_sentences[0]\n",
    "print(f\"input = '{test_sentence}'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df1c0932-7e03-42a8-bbf0-309290310327",
   "metadata": {},
   "source": [
    "# FPGA vs. CPU Inference Benchmarking\n",
    "\n",
    "### **Objective**\n",
    "This block measures and compares the inference performance of **CPU-only vs. FPGA-accelerated execution**  \n",
    "using a **quantized DistilBERT model** for text classification.\n",
    "\n",
    "### **Steps & Key Operations**\n",
    "1. **Run Inference on CPU**\n",
    "   - Tokenizes the input sentence and processes it on the CPU.\n",
    "   - Captures inference time for PyTorch execution (`cpu_time`).\n",
    "   - Extracts and stores CPU-based logits (`logits_cpu`).\n",
    "\n",
    "2. **Enable FPGA Offloading**\n",
    "   - Replaces **Q, K, V projections** with FPGA-accelerated versions.\n",
    "   - Resets FPGA timing counters before inference.\n",
    "   - Runs inference on the FPGA-accelerated model (`model_int8`).\n",
    "   - Captures **total FPGA computation time** and **average FPGA call duration**.\n",
    "\n",
    "3. **Accuracy & Performance Comparison**\n",
    "   - **Computes absolute differences** between CPU and FPGA logits.\n",
    "   - Applies **softmax** to logits to determine class probabilities.\n",
    "   - Extracts **predicted class & confidence scores** for both CPU and FPGA.\n",
    "   - Displays model confidence in an **easy-to-read format**.\n",
    "\n",
    "### **Performance Metrics Reported**\n",
    "✅ **CPU Inference Time** (Baseline execution time)  \n",
    "✅ **FPGA Compute Time** (Total and per-call breakdown)  \n",
    "✅ **Speedup Analysis** (CPU vs. FPGA execution time)  \n",
    "✅ **Accuracy Check** (Max and Mean difference between CPU and FPGA logits)  \n",
    "✅ **Predicted Class & Confidence Scores** (to validate inference consistency)  \n",
    "\n",
    "This block provides a **detailed comparison** of **inference speed, accuracy, and prediction confidence**  \n",
    "between **CPU and FPGA-accelerated execution**.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "901297f5-4db5-472e-a3bb-5d250af638c6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU Inference Time: 1.141709 seconds\n",
      "FPGA calls: 18\n",
      "Total time in FPGA compute: 0.430758 seconds\n",
      "Average time per FPGA call: 0.023931 seconds\n",
      "CPU: The model is 99.95% confident in predicting class 1.\n",
      "FPGA: The model is 99.80% confident in predicting class 1.\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "# CPU-only Inference\n",
    "inputs = tokenizer(test_sentence, return_tensors=\"pt\")\n",
    "\n",
    "start_time = time.time()\n",
    "with torch.no_grad():\n",
    "    outputs_cpu = model(inputs.input_ids)\n",
    "    logits_cpu = outputs_cpu.logits\n",
    "cpu_time = time.time() - start_time\n",
    "\n",
    "print(f\"CPU Inference Time: {cpu_time:.6f} seconds\")\n",
    "# print(\"CPU Logits:\", logits_cpu)\n",
    "\n",
    "# FPGA-Offloaded Inference\n",
    "integrate_fpga_offload(model_int8, global_act_scale_demo, accel_ip, hidden_size=768)\n",
    "\n",
    "# Reset the timing counters before inference\n",
    "FPGAQuantizedLinear.total_fpga_compute_time = 0.0\n",
    "FPGAQuantizedLinear.call_count = 0\n",
    "\n",
    "# Run inference normally with your existing code\n",
    "with torch.no_grad():\n",
    "    outputs_fpga = model_int8(inputs.input_ids)\n",
    "    logits_fpga = outputs_fpga.logits\n",
    "\n",
    "# After inference, report the detailed timing\n",
    "print(f\"FPGA calls: {FPGAQuantizedLinear.call_count}\")\n",
    "print(f\"Total time in FPGA compute: {FPGAQuantizedLinear.total_fpga_compute_time:.6f} seconds\")\n",
    "print(f\"Average time per FPGA call: {FPGAQuantizedLinear.total_fpga_compute_time/FPGAQuantizedLinear.call_count:.6f} seconds\")\n",
    "# print(f\"Adjusted speedup (FPGA compute only): {cpu_time / FPGAQuantizedLinear.total_fpga_compute_time:.2f}x\")\n",
    "\n",
    "# Compute differences\n",
    "diff = logits_cpu - logits_fpga\n",
    "max_diff = diff.abs().max().item()\n",
    "mean_diff = diff.abs().mean().item()\n",
    "\n",
    "# Compute softmax probabilities\n",
    "probs_cpu = torch.softmax(logits_cpu, dim=1)\n",
    "probs_fpga = torch.softmax(logits_fpga, dim=1)\n",
    "\n",
    "# Get predicted class and confidence\n",
    "predicted_class_cpu = torch.argmax(probs_cpu, dim=1).item()\n",
    "confidence_cpu = probs_cpu[0, predicted_class_cpu].item() * 100\n",
    "\n",
    "predicted_class_fpga = torch.argmax(probs_fpga, dim=1).item()\n",
    "confidence_fpga = probs_fpga[0, predicted_class_fpga].item() * 100\n",
    "\n",
    "# Print results\n",
    "# print(f\"Max Logits Difference: {max_diff:.6f}\")\n",
    "# print(f\"Mean Logits Difference: {mean_diff:.6f}\")\n",
    "\n",
    "display_model_confidence(logits_cpu, device_name=\"CPU\")\n",
    "display_model_confidence(logits_fpga, device_name=\"FPGA\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4a107cb-0539-46a8-8a1c-1d77fc55715c",
   "metadata": {},
   "source": [
    "## **Final Reflections: FPGA Offloading vs. PyTorch Inference**\n",
    "\n",
    "### **Key Takeaways**\n",
    "1. **FPGA Delivers Significant Computational Speedup, But System Bottlenecks Persist**  \n",
    "   - Our **benchmark shows a 7.01× speedup over PyTorch and a 214.24× speedup over NumPy** for a single **(64, 768) × (768, 3072) MatMul** operation.  \n",
    "   - Despite this raw speedup, **end-to-end inference time reduction remains limited** due to system-level inefficiencies such as data transfer overhead and synchronization delays.\n",
    "\n",
    "2. **Challenges We Encountered**\n",
    "   - **Data Transfer Overhead Dominates Latency**  \n",
    "     - Moving activations and weights **between CPU and FPGA introduces significant latency**, partially negating the speedup from FPGA computation.  \n",
    "   - **CPU-FPGA Synchronization Delays Reduce Efficiency**  \n",
    "     - The CPU often **idles while waiting for FPGA execution**, rather than executing operations in parallel.\n",
    "   - **QKV Projection Alone Does Not Account for a Large Portion of Total Compute**  \n",
    "     - While FPGA significantly accelerates MatMul, **QKV projection alone does not contribute enough to total inference time to yield major speedup**.\n",
    "   - **FPGA Start-Up & Control Overhead Adds Unavoidable Delays**  \n",
    "     - Register setup, memory flushing, and synchronization introduce additional latency before computation even starts.\n",
    "   - **PyTorch’s Highly Optimized GEMM Kernels Reduce the Acceleration Gap**  \n",
    "     - PyTorch’s **int32 GEMM (MKL-DNN/TensorRT)** is already highly efficient, making it harder to achieve dramatic acceleration unless we offload a larger portion of the workload.\n",
    "\n",
    "---\n",
    "\n",
    "### **Lessons Learned and Future Considerations**\n",
    "#### **1. Reducing Data Transfer Overhead is Crucial**\n",
    "- To unlock **FPGA’s full potential**, future work should minimize data movement **between CPU and FPGA**.\n",
    "- Using more **persistent FPGA memory** for activations (like what we did for input embedding) and implementing **DMA (if available)** could further reduce transfer latency.\n",
    "\n",
    "#### **2. Improving CPU-FPGA Execution Pipelining**\n",
    "- **Overlapping CPU and FPGA execution** is essential—while the FPGA computes QKV projection, the CPU should process **attention softmax or FFN layers** in parallel.\n",
    "- **Double buffering techniques** can ensure that **data is transferred while computation is ongoing**, reducing idle time, but it requrire careful timing design otherwise more latency will be introduced.\n",
    "\n",
    "#### **3. Expanding FPGA Workload Beyond QKV**\n",
    "- The **Feed-Forward Network (FFN) layer**, which consists of **(64, 3072) × (3072, 768) MatMul**, is computationally expensive and an ideal candidate for FPGA acceleration.\n",
    "- **Self-attention softmax computation** could also be offloaded to FPGA to further reduce CPU workload and improve overall efficiency.\n",
    "\n",
    "#### **4. Optimizing FPGA Utilization and Scaling**\n",
    "- **Parallelizing Q, K, and V projections** on separate systolic arrays could further improve efficiency.\n",
    "- Keeping the **FPGA accelerator active between layers**, rather than resetting registers and flushing memory for each computation, would reduce unnecessary overhead.\n",
    "\n",
    "---\n",
    "\n",
    "### **Final Thoughts**\n",
    "This project has **successfully validated the power of FPGA acceleration for deep learning workloads**, demonstrating a **7× speedup over PyTorch CPU execution** at the **MatMul level**. However, we have also seen firsthand that **raw computation speed is only part of the equation**—**data movement, synchronization, and system integration play an equally critical role** in achieving real-world performance gains.\n",
    "\n",
    "For future FPGA acceleration research, a **holistic approach** that integrates **both compute and system-level optimizations** will be necessary to fully harness the hardware’s potential. Our findings serve as a **guiding reference for future efforts**, emphasizing that true performance gains require a **balance of raw compute acceleration and efficient system-level execution**.\n",
    "\n",
    "While this marks the completion of our current project, we hope that the insights gained here will pave the way for more **efficient and scalable deep learning inference on FPGA**. 🚀\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
